1. Introduction

1.1 Overview and Motivation

Nowadays, vaccination is a very common and over-stressed issue worldwide. Vaccinations campaigns seem to be run in very different ways by several countries. There are continuous rising conflicts worldwide about the fairness of the vaccination campaign which are undermining the policies adopted by various countries in the attempt of recovering in such a complex post-pandemic situation. Such a heterogeneous pool of reactions to the ongoing pandemic is quite peculiar and it calls for some emblematic questions: why some people seem more willing to get vaccinated, why some countries are struggling in the process? Are there any unconscious beasts in the individual mind preying on their fears? History teaches us that people’s fickleness may be molded in several manners: affecting their educational level, tackling their well-being, limiting their freedom are just a few of the many examples it is possible to extract.

Focusing on the Italian perspective, the public opinion in recent days is strongly divided on the Italian vaccination campaign, closely linked to the possibility of being able to return to an almost normal life. In fact, there have been numerous demonstrations, which have resulted in violent clashes on more than one occasion. Although the problem seems “trivial”, the reality is quite different. Through an articulated system of restrictions and passes (so-called Green Pass), the government led by Prof. Mario Draghi has introduced an implicit vaccination requirement, splitting in two both the country and the politics. The social conflict also stems from a polarization due to the Italian media, which have apparently contributed to the creation of a rift between those who are “educated and responsible” and the “ignorant irresponsible”, making the choice on whether or not to vaccinate a personal issue and not a social issue. In this climate of social tension, political leaders have not been able to show themselves in front of the population compactly and decisively, often taking ambiguous attitudes that - unfortunately - have often resulted in real misinformation not only during the pandemic but also in previous years.

Aim: The reasons behind the choice of our project question are motivated by the willingness to shed the light on this jungle of nebulous information that people are exposed to while dealing with the very trivial decision of getting vaccinated.


1.3 Project Objectives

The purpose of the project is to provide the reader with a wide set of potential tools that a policymaker can use in accessing the efficacy of a specific vaccination campaign. In other words, we will not limit the attention of our research question to the mere descriptive analysis of a single vaccination campaign, already deeply discussed, but rather we would try to put an emphasis on all the factors affecting the well-going of a potential vaccination campaign. Level of education, cultural heritage, political beliefs, people welfare, covid related past experiences is just a few of the different factors we argue that could play a crucial role in shaping people’s decision and current belief about the vaccination campaign. A prescient policymaker should be, thus, able to design an efficient vaccination campaign by taking into account all the sources of potential distress that can arise in a specific situation of interest. In stressing such complex questions we started from the highly specific case of Italy and deeply analyzed the ongoing Italian framework. Italy is showing a very heterogeneous response in the overall vaccination campaign, some regions seem more willing to get vaccinated, some others still struggle. Given the centralized system of vaccination supplies, we may assume there must be some additional and partly hidden factors shaping this willingness to get vaccinated, consequently influencing the well going of the campaign.

Summing up: the project objective does not limit its focus on a descriptive analysis of the ongoing vaccination campaign but, more ambitiously, it tries to shed light on all the factors positively or negatively affecting people’s willingness to get vaccinated.

By starting with the evidence from the Italian case, the final outcome would be the proposal of a case-based framework of vaccination policy design accounting for all the correlation factors and potentially implementable to all the other countries worldwide. The potential benefits of these types of research questions are big and widespread, especially for those countries which are still lagging behind in the vaccination campaign policy and for the developing countries which have not yet been able to deeply enter in this policy aspect.

Also, with hindsight bias, the thoughts and the considerations we will deal with, might be of great relevance while facing future pandemic challenges. What could we do next? Is there the possibility to build up a better pandemic plan for the future?


1.4 Research questions

The project will be structured as follow:

The general framework and background

  • How has Italy been affected by covid-19, both at a national and regional level? What is the ongoing state of the vaccination campaign? Are there any differences between the 20 regions? How can these differences shape people’s preferences?

Getting deeper into the analysis

  • Before the vaccination campaign, the Italian government took on several responsibilities in terms of restrictions with respect to the mobility of citizens, limiting their movements both in time and physically/geographically. Since these are the alternative to mass vaccination, when and how effective are they?

  • Starting from the assumption (which will be demonstrated) that some Italian regions have suffered much more heavily from the effects of the pandemic, can we consider the psychological/traumatic aspect as a driver for a greater propensity to vaccination? By looking at the death/case ratio in areas strongly hit by the pandemic, are we able to find a positive correlation with the propensity of getting vaccinated?

  • Could we explain a higher propensity to vaccination in the single regions by using the number of graduates of the same area as a proxy for the level of education shown by the individuals?

  • Is there a link between the economical situation of the regions and the number of people that decide to get vaccinated? Could wealth and well-being explain people’s perceptions?

  • Is there a connection between fewer government restrictions and the willingness to get vaccinated? Are people who suffer from social restriction more prone to go against the government?

  • Suppose we use the illiteracy index per region as an indicator of the qualitative variable “cultural differences”. Could this be an explanation of the reasons why people refuse to get vaccinated?




2. Data

2.1 Sources

The majority of data have been taken from the Official ISTAT website. ISTAT is a complete and national data warehouse that Italian experts, statisticians, and scholars use for their daily analysis. Some data about COVID measures have also been taken from the data world website. We took this choices as data were exactly the same provided by the Italian government, but with the useful differences of being presented jointly in a daily format, which is more practical in terms of future analysis. As follows we report the links where we have taken some of the variables of interest for our current analysis:

Pool of useful Dataframe links
Vaccination Dataframe Inequality Dataframe Covid Dataframe
Political Information Channels Dataframe Vaccine Shipments to Italy Dataframe Employment Dataframe
Educational Level Dataframe Population Dataframe Income Dataframe

2.1.1 Description

Our Analysis finds its main pillars in a pool of 9 different tables, concerning some of the macro areas of interest going from covid and vaccination in a broad sense to some more explanatory variables regarding the regional GDP, the poverty rate, the measure of inequality of individuals, the employment level, the educational level, etc.

A first overview in our pool of datasets is presented as follow and may serve as reference for the next sections of the analysis which go more deeply into the research question.

Some insights
Vaccine Dataset contains 140305 observations and 15 total columns Gini Index of inequalities table contains 55 Entries and 14 Total Columns Covid Dataset contains 12,706 entries and 31 total columns
Main Channels of information includes 1,729 entries, 11 total columns Vaccine Shipments to Italy Dataframe contains 5201 observations and 5 columns Occupational Level table exhibit 7,898 entries, 15 total columns
Education Level Dataset contains 1,585 entries, 17 total columns Population Dataset contains 21 entries in 15 total columns Income Dataset contains 21 entries and 13 columns

Some of the datasets such as the Population and the poverty ones, contain fewer observations since data are presented in macro aggregated areas and no more at a regional-specific level (south, north, center, islands), thus a certain degree of data wrangling, in terms of macro-aggregation, will be required in dealing with this variable. Still, we consider it very valuable as southern Italian regions and northern Italian regions tend to exhibit similar features within each sub-group respectively.

Last, we report some of the relevant variables that we think could be a good starting point to work with as follows:

Incoming Familiar Variables
age_group area date gender_sex region
producer vaccine_furniture first_dose second_dose total_hospitalised
total_positive new_positive death total_cases swabs
death_ratio swab_ratio mortality_rate studies_title employment_rate
gini_index vax_ratio tot_population tot_no_education_ratio tot_graduated_ratio

Note: we have changed the name of some variables to ease the understanding)


2.2 Wrangling and cleaning

2.2.1 Vaccination Dataframe

We started our analysis by focusing our attention on the pool of data grouping the individuals subjected to two doses of vaccination or to a single dose of the Jonsson vaccine. Our purpose was, in fact, to find a suitable proxy of the overall number of completely vaccinated subjects.

df_vax <- read.csv(url("https://raw.githubusercontent.com/italia/covid19-opendata-vaccini/master/dati/somministrazioni-vaccini-latest.csv"))
df_vax <- df_vax %>%
  select(-c(3,4, 9:13)) 
df_janssen <- df_vax %>% 
  filter(fornitore=="Janssen")

second_dose <- aggregate(df_vax$seconda_dose, by=list(area=df_vax$nome_area), FUN=sum)
janssen_dose <- aggregate(df_janssen$prima_dose, by=list(area=df_janssen$nome_area), FUN=sum)

In order to study the heterogeneity of the vaccination campaign we further add another geographical distinction based on the macro-area of origin of vaccinated subjects. In this case, the command mutate fitted our needs.

vaccination_region <- second_dose %>% 
  left_join(janssen_dose, by=("area")) %>%
  rename(second_dose= `x.x` , janssen=`x.y` , region=area)

vaccination_region2 <- tibble(vaccination_region$region) %>%
  rename(region=`vaccination_region$region`)%>%
  mutate(area= c("South", "South", "South", "South", "North", 
                 "North", "Center", "North", "North",
                 "Center", "South", "North", "North",
                 "North", "South", "South", "South",
                 "Center", "Center", "North", "North")) %>%
  arrange(area)

We started by creating our very first vaccination dataframe.It contained the aggregate number of vaccinated subjects either with two doses or with the Jonsson vaccine (which require only one shot). Each results was, thus, organized by regions and by macro areas

vaccination_region <- vaccination_region %>%
  left_join(vaccination_region2, by=("region"))%>%
  relocate(area, .after = region)

Further insights: The initial dataset covers the period from the 27 December 2020 to today. As we were interest in computing the summary statistic of the above mentioned variables, we made use of the aggregate command; namely, it allowed us to split the data into subsets, compute summary statistics for each, and return the result in a very convenient form.


2.2.2 Covid Dataframe

As far as the Covid dataframe is concerned, we took the decision of downloading it directly from data.world which provides daily update on the COVID ongoing pandemy in Italy.As before, we eliminated the dataframe columns that for the purpose of our analysis where redundant, renaming the remaining in a more intuitive fashion. The variable of interest in this case were different: from the almost familiar date(in a daily format) and region to the total hospitalised cases, the total positive, the number of new positives, the number of death and the swabs done.

df.regional.covid <- fromJSON("https://query.data.world/s/r6heradt54t2pjttgjfqumjxp6rhjj")

df.regional.covid = select(df.regional.covid, -2, -3, -(5:8), -10, -12, -14, -16, -17, -(20:30)) %>% 
  rename(date=data,region=denominazione_regione,total_hospitalised=totale_ospedalizzati,total_positives=totale_positivi, 
         new_positives=nuovi_positivi,deaths=deceduti,total_cases=totale_casi,swabs=tamponi)

i<-is.na(df.regional.covid)
 

As it would have been of big interest working on everly fresh data we made use of the command slice after having it arranged by descending date. Note that now we have been able to obtain yearly aggregated values from daily ones which, for the purpose of our future analysis, will be crucial.

df.total.per.region <- arrange(df.regional.covid, desc(date))%>% 
  slice(1:21)

df.total.per.region = select(df.total.per.region, -1,-(3:5)) 

Similarly to the Vaccination Dataframe, we have created and added (thanks to mutate) a new columns with the variable North-East, North-West, Center and South and with the idea of easing a future merging ultimately summarizing all the information in the same dataframe.

df.total.per.region <- df.total.per.region %>% mutate(area = case_when(
  (region == "Veneto") ~ "North", (region == "Emilia-Romagna") ~ "North", (region == "P.A. Trento") ~ "North", (region == "P.A. Bolzano") ~ "North", (region == "Friuli Venezia Giulia") ~ "North",
  (region == "Lombardia") ~ "North", (region == "Liguria") ~ "North", (region == "Piemonte") ~ "North", (region == "Valle d'Aosta") ~ "North",
  (region == "Lazio") ~ "Center", (region == "Marche") ~ "Center", (region == "Toscana") ~ "Center", (region == "Umbria") ~ "Center",
  (region == "Abruzzo") ~ "South", (region == "Basilicata") ~ "South", (region == "Calabria") ~ "South", (region == "Molise") ~ "South", (region == "Campania") ~ "South", (region == "Puglia") ~ "South", (region == "Sicilia") ~ "South", (region == "Sardegna") ~ "South",)) %>%
  relocate(area, .after = region)

df.total.per.region [6, 1] <- "Friuli-Venezia Giulia"
df.total.per.region [12, 1] <- "Provincia Autonoma Bolzano / Bozen"
df.total.per.region [13, 1] <- "Provincia Autonoma Trento"
df.total.per.region [20, 1] <- "Valle d'Aosta / Vallée d'Aoste"

df.total.per.region <- df.total.per.region %>% 
  arrange(region)


2.2.3 Employment Level Dataframe

As far as the Dataset on Employment is concerned, we get rid of redundant columns similarly as before. For the purpose of the analysis, we decided to keep the variables accounting for: regions, age class, study title and,the one of highest interest, the value standing for the Employment level. As the dataset contained also aggregated values, we cleaned it and filtered for specific region only and for the age class of highest interest, namely the one grouping subjects from 15 to 64 years old.


df.oc <- read.csv("../data/TassoOccupazione.csv")
df.oc <- df.oc %>% select(-c(1, 3:5, 7, 9, 12, 14, 15)) %>%
  dplyr::rename(
    region = Territorio,
    sex = Sesso,
    ageclass = Classe.di.età,
    studytitle = Titolo.di.studio,
    oc_level = Value
  )

p <- is.na.data.frame(df.oc)

df.regions <-
  df.oc %>% filter(
    !(
      region == "Italia" |
        region == "Nord" |
        region == "Nord-est" |
        region == "Nord-ovest" |
        region == "Centro" |
        region == "Mezzogiorno" |
        region == "Sud"
    ),
    ageclass == "15-64 anni" ,
    TIME == "2019" | TIME == "2020"
  ) %>%
  mutate(time = as.numeric(TIME))

We then argued that a good summary variable for average employment would have been to compute the mean employment level among educational level classes and within region. In doing this, we made use of the command summarise after having grouped by regions and time. Notice that, as our analysis will be based on 2020 vaccination campaign, we chosed to keep only 2020 values.

df_occ <- df.regions %>%
  filter(!TIME == "2019",!region == "Trentino Alto Adige / Südtirol") %>%
  select(region, studytitle, oc_level, TIME) %>%
  group_by(region, TIME) %>%
  summarise_at(vars(oc_level), list(m_oc_reg = mean)) %>%
  arrange(desc(m_oc_reg)) %>%
  pivot_wider(names_from = TIME, values_from = m_oc_reg) %>%
  dplyr::rename(occ_level = "2020")


2.2.4 Inequality Dataframe

Inequality Dataset was not big, but it was full of useful information. After having loaded it and selected for the information we where looking for(region, macro areas and gini index) we filtered for the information corresponding to including imputed rents only, as it would better reflect a proxy for inequalities between Italian regions.

df_gini <- read.csv("../data/IndiceGini.csv") 

df_gini<- df_gini%>%
  select(c(2,8,9,11))%>% 
  filter(!(Including.or.not.including.imputed.rents=="not including imputed rents"| Territory=="Italy"| 
             Territory=="Centro (I)"|Territory=="Isole"| Territory=="Sud"| 
             Territory=="Nord-ovest"| Territory=="Nord-est")) %>% 
  arrange(Territory) %>% 
  dplyr::rename(region=Territory)


df_gini2 <- tibble(df_gini$region)

df_gini2["area"] <- c("South", "South", "South", "South", "North", 
                      "North", "Center", "North", "North",
                      "Center", "South", "North", "North",
                      "North", "South", "South", "South",
                      "Center", "Center", "North", "North")

df_gini2 <- df_gini2 %>%
  dplyr::rename(region=`df_gini$region`)

df_gini <- df_gini %>%
  select(-c(2,3)) %>%
  left_join(df_gini2, by=("region")) %>%
  relocate(area, .after = region)%>%
  dplyr::rename(gini_index=Value)

Further insights: The Gini index, or Gini coefficient, is a measure of the distribution of income across a population developed by the Italian statistician Corrado Gini in 1912. The most updated measure of inequality we found for Italian regions is the one provided by the ISTAT official portal. Note that it corresponded to the year 2018, so do not contains effect of the ongoing pandemy. We still decided to use it as the COVID crisis affected almost uniformly the economy of the Italian regions, so that using a pre-COVID measure, would have not consituted a relevant bias.


2.2.5 Population Dataframe

The amount of work we did in this dataframe was to some extent small, but it ended up being very precious for the purpose of the analysis. In fact, having a variable accounting for the regional population is fundamental while comparing different results as it allows you to reason in relative terms and not only absolute one, which in some senses may be misguiding.

df_pop <- read.csv("../data/Popolazione.csv") %>%
  select(-c(1, 3:12, 14, 15)) %>%
  dplyr::rename(region = Territory, tot_pop = Value)


2.2.6 Educational Dataframe

The dataframe on Education plays an important role in displaying the level of Educational Attainment by region. We decided to focus on the population 15 years and over and to the usual 2020 period.Notice that we converted the values of educated people in thousands for simplicity of visualization. We have also filtered away some of the rows like “Trentino Alto Adige / Südtirol”that have already been accounted for when considering indipendently the “Provincia Autonoma Bolzano / Bozen from the Provincia Autonoma di Trento”

df_educ <- read.csv("../data/Education.csv")
df_educ <- df_educ[-c(1, 3, 5, 7:9, 11, 12, 14, 16, 17)] %>%
  filter(
    !(
      Gender == "total" |
        Highest.level.of.education.attained == "total" |
        Territory == "Trentino Alto Adige / Südtirol"
    )
  ) %>%
  dplyr::rename(
    ed_people = Value,
    ed_level = Highest.level.of.education.attained,
    Time = TIME,
    region = Territory
  )

df_educ["ed_people"] = df_educ["ed_people"] * 1000

As we were interested in people having reached an higher level of education, we also filtered for the ones who have successfully completed a tertiary level of specialization. As in the previous sections we group by region and when the time was 2020.

df_educ1 <- df_educ %>%
  filter(ed_level == "tertiary (university, doctoral and specialization courses)",
         Time == "2020") %>%
  group_by(region, Time) %>%
  summarise(uni_people = sum(ed_people)) %>%
  pivot_wider(names_from = "Time", values_from = "uni_people") %>%
  dplyr::rename(tot_graduated = "2020")

education_area <- tibble(df_educ1$region)

education_area["area"] <-
  c("South","South","South","South","North",
    "North","Center","North","North","Center",
    "South","North","North","North","South",
    "South","South","Center","Center","North","North")

education_area <- education_area %>%
  rename(region = `df_educ1$region`) %>%
  left_join(vaccination_region2, by = (c("region", "area")))

df_educ1 <- df_educ1 %>%
  left_join(education_area, by = ("region")) %>%
  relocate(area, .after = region)

Furthermore, we also decided to account for non educated people. The coding process and the rationality behind this reasoning are exactly the same as before, with the difference that now the initial filtering was applied to those who had only a primary school certificate.

df_educ2 <- df_educ %>%
  filter(ed_level == "primary school certificate, no educational degree",
         Time == "2020") %>%
  group_by(region, Time) %>%
  summarise(no_ed_people = sum(ed_people)) %>%
  pivot_wider(names_from = "Time", values_from = "no_ed_people") %>%
  rename(tot_no_ed = "2020")


2.2.7 Political Information Channels Dataframe

We have decided to included this dataset as we have a big problem in Italy, namely the mechanism of information. The clarity of the Italian Informational channel is not always so shining, to these extent we argued that bad information may discourage people from vaccinating. This dataset concerns the main channels of information through which people inform themselves about political lives. As it is not arguable that political choices shapes our current well being and decision, there may exist a correlation between bad channel of information and lower vaccinations.

In terms of wrangling, we performed the now familiar selecting process and we filtered for regional specific rows only, as the initial dataset contained also macro area results. Moreover, we filtered for the variable stating the number of Newspaper as it could be a fair proxy for verifiable good channels of information.

df_infochannel<- read.csv("../data/MezziInfo.csv")

df_infochannel_r <- df_infochannel %>% select(-c(1, 3, 5, 8, 10, 11)) %>%
  filter(
    !(
      Territory == "Nord" |
        Territory == "Nord-est" |
        Territory == "Nord-ovest" |
        Territory == "Mezzogiorno" |
        Territory == "Centro (I)" |
        Territory == "Italy" |
        Territory == "10,001 - 50,000 inhab." |
        Territory == "2,001 - 10,000 inhab." |
        Territory == "until 2,000 inhab." |
        Territory == "50,001 inhab. and over"
    ),
    Data.type == "newspapers" ,
    Measure == "thousands value",
    TIME == "2020"
  ) %>%
  mutate(reported_number = Value * 1000) %>%
  select(-c(3, 5)) %>%
  filter(
    !(
      Territory == "metropolitan area - centre" |
        Territory == "metropolitan area - suburbs" |
        Territory == "Trentino Alto Adige / Südtirol" |
        Territory == "Isole" | Territory == "Sud"
    )
  ) %>%
  select(-c(2)) %>%
  dplyr::rename(news_read = reported_number, region = Territory)

The newly obtained infochannel dataset will have the following variables: region, Time and Newspaper read.


2.2.8 Income per Italian Region


df_income <- read.csv("../data/Reddito.csv") %>%
  select(-c(1, 3, 4, 5 , 6 , 7, 8, 10, 12, 13)) %>%
  rename(region = Territory, Income= Value) %>%
  select(!TIME)%>%
  arrange(Income)

The dataframe on income by region helps us to determine which Italian regions are most relevant from an economic perspective, helping us to understand whether the wealth factor influences variables such as swabs performed or cases registered (the assumption being that the pandemic entered the “wealthier” regions with greater ease). The operations performed on the dataset were fairly straightforward, namely eliminating redundant columns and reordering the values in descending order.


2.2.9 Vaccine Shipments to Italy

app_vax <-
  read.csv(
    url(
      "https://raw.githubusercontent.com/italia/covid19-opendata-vaccini/master/dati/consegne-vaccini-latest.csv"
    )
  )

app_vax <- app_vax %>%
  select(-c(5:7)) %>%
  dplyr::rename(
    date = data_consegna,
    region = nome_area,
    vaccines = numero_dosi,
    producer = fornitore
  )

i4 <- grepl("^[0-9]{4}", app_vax$date)
v4 <- as.Date(app_vax$date[i4])
app_vax$date <- v4

s <- str(app_vax)
#> 'data.frame':    5392 obs. of  5 variables:
#>  $ area    : chr  "ABR" "ABR" "ABR" "ABR" ...
#>  $ producer: chr  "Pfizer/BioNTech" "Pfizer/BioNTech" "Pfizer/BioN"..
#>  $ vaccines: int  135 7800 3900 3900 3900 4875 1300 3510 3510 2340 ..
#>  $ date    : Date, format: "2020-12-27" ...
#>  $ region  : chr  "Abruzzo" "Abruzzo" "Abruzzo" "Abruzzo" ...

app_vax <- app_vax %>%
  arrange(area, date) %>%
  group_by(region, producer)

The dataframe “vaccine shipments” collects within it the data on the supply of vaccines by the Italian Government, with data from December 2020 up to today, the date on which the report is analyzed. In fact, as for the dataset on the covid situation and on the administration of vaccines themselves, we believe it is important to work with constantly updated data, in order to have a clear and precise vision of the scenario in which Italy is moving. At the “operational” level, the dataframe had a structure that did not lend itself well to our needs: we have therefore worked to fix some aspects, listed below. First of all, we have selected the columns of our interest and converted the date from the “character” format to the “date” format, in order to be able to carry out temporal analyses. Then, we rearranged the overall structure of the dataframe in order to be able to perform our analysis more quickly, for example by converting the date format from daily to monthly (this operation, however, is reported, in the form of a formula, in the exploratory analysis part of this work)


2.2.10 Poverty Dataframe


df_poverty<-read.csv("../data/Poverta.csv")
df_poverty<- df_poverty %>% select(-c(1,4,6,8,9)) %>%
  dplyr::rename(region=Territorio,povertyname=TIPO_DATO8, povertyindex= Value) %>%
  filter(region=="Nord" | region=="Nord-est" |region=="Nord-ovest"|region=="Centro"|region=="Mezzogiorno", povertyname == "INTENS_POVREL_FAM")


r<-df_poverty %>% 
  filter(povertyname == "INTENS_POVREL_FAM", TIME=="2020") %>%
  group_by(region, povertyname,TIME) %>%
  summarise_at(vars(povertyindex), list(m_poverty_ar = mean)) %>%
  arrange(desc(m_poverty_ar))%>%
  pivot_wider(names_from = TIME, values_from = m_poverty_ar) 

kable(r)%>%
  kable_styling(font_size = 14)
region povertyname 2020
Mezzogiorno INTENS_POVREL_FAM 22.7
Nord-ovest INTENS_POVREL_FAM 21.5
Nord INTENS_POVREL_FAM 20.4
Nord-est INTENS_POVREL_FAM 18.8
Centro INTENS_POVREL_FAM 18.1

Poverty Dataframe is a whim of knowledge about the Italian Economical shape. We have mainly used this dataset in order to deepen a clear background idea of how poverty is widespread among italian sub regions. In doing this, we have decided to focus on the measure of relative poverty of families among macro areas.Unsurprisingly, this measure is higher for regions in the south of Italy rather than for the one in the North. A clear symptom of existing divergences within country.


2.3 The beauty of the final Dataframe

After all this coding work of wrangling, filtering and selecting, we felt the need of reorganizing the ideas.To this aim, we decided to collect all these precious information into a single final dataset. In doing this, we abused the use of the command leftjoin which perfectly fitted our needs. Starting from the initial vaccination dataframe, we stacked columns of each variable of interest by regions and by areas of origin.


vaccination_region <- vaccination_region %>%
  left_join(df.total.per.region, by = c("region", "area"))

vaccination_region <- vaccination_region %>%
  left_join(df_educ1, by = c("region", "area"))

vaccination_region <- vaccination_region %>%
  left_join(df_educ2, by = "region")

vaccination_region <- vaccination_region %>%
  left_join(df_pop, by = "region")

vaccination_region <- vaccination_region %>%
  left_join(df_gini, by = (c("region", "area")))

vaccination_region <- vaccination_region %>%
  left_join(df_occ, by = "region")

vaccination_region <- vaccination_region %>%
  left_join(df_income, by = "region")

vaccination_region <- vaccination_region %>%
  left_join(df_infochannel_r, by="region")%>%
  relocate(TIME, .after = area)

An interested reader, may now start to see the picture more clearly but there is still some variables which may leave him puzzled. Here it is where the population dataset we loaded in the previous section becomes handy. In fact, it allows to see the real picture in relative terms. We, thus, made use of the command transform and added some variables in relative terms which will be the center of our future arguments. The new variables are: vaccination ratio both for Janssen and the 2 doses vaccines, total relative educated students, total relative uneducated student, total relative swabs, total relative cases, total relative death and mortality rate.

vaccination_region <- vaccination_region %>%
  transform(
    vax_ratio = second_dose / tot_pop,
    vax_ratioj = janssen / tot_pop,
    tot_grad_ratio = tot_graduated / tot_pop,
    tot_no_ed_ratio = tot_no_ed / tot_pop,
    income_ratio= Income /tot_pop,
    swabs_ratio = swabs / tot_pop,
    cases_ratio = total_cases / tot_pop,
    death_ratio = deaths / tot_pop,
    mortality_rate = deaths / total_cases,
    news_ratio= news_read/tot_pop
  )


The Final Dataframe:




# 3. Exploratory data analysis

3.1 Covid cases in Italy


Before observing and making assumptions about the Italian vaccination campaign, it is necessary to focus on the trend of the coronavirus epidemic in the country. The following graph shows how the regions have been affected by the virus in a “non-compliant” way: the different colors represent in fact the number of Covid cases per region in relation to the number of the population (updated to the last population census of January 1, 2021). This representation is consistent with the events that occurred at the beginning of 2020: if it is true that the first two cases on Italian soil were registered in the city of Rome (Lazio), these were attributable to tourists who presumably had less close contact with the outside world.

The “patient zero” of Italian nationality was identified on February 21, 2020 at the Hospital of Lodi (Lombardia) and only one day later the existence of an outbreak in Vo’ (Veneto) was brought to light. As a matter of fact, Northern regions have a much clearer coloration than the rest of the country.


3.2 Covid deaths in Italy


The considerations made in Graph 1 are reflected in the second, which, following the same methodology, shows how the number of deaths in relation to the population is higher in the regions of Northern Italy, where the greatest number of cases has been recorded overall. The entry of the virus earlier in those areas of the country has influenced the number of deaths recorded to date.


3.3 A different perspective on Covid cases in Italy



Departing from the regional trend, this interactive chart (it is possible to navigate between all the dates observed by means of the bar below the chart) shows the trend of covid cases from the first day in which the tracking was started, 25/02/2020, until today, as data are constantly updated. The interactivity allows the observer to make numerous considerations about both the progress of the Italian epidemic and the effectiveness of the measures put in place. Following a chronological order, the first interesting data is recorded on March 21, 2020, when the epidemiological curve (represented here by the daily trend) began to decrease, underlining how the lockdown measures taken by the Italian government first in February for the northern regions and then on March 9, 2020 for the whole country began to give the first results. On May 4, 2020 and then on June 15, 2020 the measures are progressively loosened: with the arrival of the summer period the infections remain stable until the middle of August.

The gradual but steady increase will not disturb the souls of the Italians until the beginning of October 2020, when on day ten of the month 5724 cases are recorded. Only 15 days later the cases will be 21273. It is evident that, with the beginning of the autumn season and without preventive measures, the virus finds a much more fertile ground to expand. The pandemic will touch the peak in Italy just nineteen days later, when on November 13 the total daily cases are 40902. The regions are divided into colors according to the epidemiological trend and a curfew is instituted from 22:00 to 5:00. We will return to the data in this chart later, as between December 2020 and January 2021 Italy kicked off its immunization campaign.


3.5 Cases vs. Deaths & Swabs vs. Cases (Interactive Bubble Charts)



The following charts are presented in pairs because we believe it is important to compare the data contained within them. The first (graph 5) shows how there is a positive correlation between the number of cases and the number of deaths, once again underlining the gap between the regions of the north and the regions of the center-south in terms of cases and deaths. There is, however, a consideration to be made: if the deaths of this pandemic are a number that hardly differs from reality (people with serious symptoms are taken to the emergency rooms where a swab is made for diagnosis), the same statement does not apply to cases. This is due to the fact that not all people affected by covid-19 manifest a symptomatology such as to require hospitalization, let alone always present some symptomatology (the phenomenon of the so-called asymptomatic). From this consideration we can deduce that another data that takes on particular importance is to be found in the number of swabs carried out per region and this leads us to the second graph (graph 6). Also in this graph there is a positive correlation between the number of cases and the number of swabs which is, in a certain sense, obvious: the more people who are covid positive, the more are found.

It is in these terms that the data presented above (i.e., those showing that the northern regions were the most affected) take on different contours: it is still true that the pandemic began in those regions, but it is also true that these are the ones that have carried out the greatest number of swabs on the population. Consequently, the data from the regions of Northern Italy on contagion are certainly more indicative than those of the South. There is a clear outlier in these representations, namely the Autonomous Province of Bolzano/Bozen. We refer to the end of this chapter, more specifically in section 3.10 , for considerations on this issue.


3.6 Vaccination Campaign in Italy: a winning approach?


Graphs 7 and 8 show respectively the trend of the Italian vaccination campaign for the second and first dose. This started on December 27, 2020, with the administration of 7313 first doses, while the vaccination through the second doses started on January 17, 2021, with 2987 doses. From the trend of the two graphs it is clear that Italy has opted for a vaccination plan focused on the administration of both the highest number of first doses and booster doses, or second doses, compared to the supplies of pharmaceutical companies. The data contained in graph 8 refer to all types of vaccine administered (Janssen, Moderna, Pfizer and Vaxzevria), while graph 7 takes into consideration only those vaccines that are administered in two doses (Moderna, Pfizer and Vaxzevria). The numerical evidence of the assumption made above, i.e. the approach that Italy has adopted towards the vaccination campaign, can be found in the “peaks” that can be observed: for the first dose, this was recorded on June 4, 2021, while for the second doses only one month later, i.e. on July 15, 2021.

date country hospitalized_with_symptoms intensive_care total_hospitalised quarantined total_positives total_variation_of_positive_cases new_positives discharged_from_hospital deaths
2020-11-10T17:00:00 ITA 28633 2971 31604 558506 590110 16776 35098 363023 42330
2021-11-10T17:00:00 ITA 3447 423 3870 98989 102859 2654 7891 4591328 132551


At the moment, this strategy seems to have been somewhat correct: on November 10 of last year 35098 cases were recorded in Italy, while on the same day of 2021 the cases are just 7891, with only 3870 people currently hospitalized. On the same day a year ago they were 31604. We can therefore say that the vaccination campaign has played a central role in easing the pressure on the national health system (SSN) that, in 2020, seemed to be on the verge of collapsing.


3.7 (Cont’d) Vaccination Campaign in Italy: a winning approach?

As shown in graph 9 above, the analysis of the Italian vaccines-acquisition policy shows how the government is relying on mRna based technology vaccines. Infact, what is shown here is a substantial letup in the number of doses available to be injected regarding the Janssen vaccine and the Vaxzevria one. In the Janssen case, this is mainly due to the fact that the Ministry of Health (in accordance with the European Union) has decided before and during the campaign to not purchase big amounts of this product. Moreover - according to the scientific world - it is not as effective as the other vaccines, but this is merely a consideration as our goal is not investigate these complicated “issues”.

On the other hand, the Vaxzevria case in more interesting. It is clear from the graph that initially there was a high demand of such product by the italian authorities, with a rapid change of trend at the beginning of July. If we analyze what was happening back in these days, the public opinion was harshly questioning AstraZeneca’s product efficacy and safety, leaving fertile ground for mass hysteria, as people started refusing the swedish-british compound. From this moment on, fate was sealed for AZ, with the italian Goverment deciding to devolve the doses that were still kept in freezers to the Covax program.


3.8 Income per Region

The graph above shows how the regions of the north and especially the center are richer than those of the south. Precisely this greater wealth of those areas of the north and center have allowed the regions in question to put in place more efficient contact tracing campaigns, which provide as a final result a percentage of positive compared to the population obviously higher than the southern regions, “lazier” in this respect. Another fact that has previously emerged clearly from our analysis is that, looking at the data on education, there is a large disparity between North and South. We are not going to try to provide explanations for this evidence because it is not the objective of this paper, but we refer the reader to an in-depth study of the so-called “southern question”.


3.9 Overview of ratio factors in the final df as Heathmap

With the heathmap presented above, our goal is to provide a view of the final data frame - at the regional level - that is even clearer than what has been said so far. Before presenting our considerations, we would like to underline how this heathmap should be read: lighter colors (e.g. yellow) represent a data that deviates positively from the general trend of that variable for the 21 regions considered. On the contrary, colder colors (e.g. purple) indicate a value that deviates negatively from the trend. In fact, in the processing of the Heathmap, the software proceeds to a normalization of the same, to make effective visualization. On our side instead, we have modified the order in which the regions are arranged with respect to the final data frame “vaccination_region”: in this case, it is useful to group them by geographical areas (in this case the order is Center, North, South, from the top to the bottom of the map). We note once again how the data of the North first and then of the Center are different from those of the South: this strengthens the assumption made at the beginning of this document, that the spread of the pandemic has occurred “cascade”, or from North to South, thus giving rise to results that - to date - are not homogeneous. In spite of this, also this chart supports the hypothesis that the regions of the North have reacted in a more efficient way in the so-called contact tracing (that is, the number of swabs carried out as a percentage of the population).

With respect to the vaccination campaign, the map shows how some regions of Southern Italy (Sicily, Campania and Calabria) are lagging behind those of the Center-North. The distribution of colors in the map also highlights another interesting factor, namely how the most affected regions (north) have higher rates of vaccination than those in the south, touched less by the infection. It also seems that the ratio of cases/population has influenced the willingness of the regions to expand their testing campaigns: once again, in the center and especially in the north, swabs carried out on the population are much more numerous.


3.10 Our findings so far


Considering all the steps of the analysis we have seen from the beginning of this work until now, we can make a summary of the considerations and evidence we have crossed so far. Concerning the course of the coronavirus epidemic in the Italian peninsula, which has had a non-conforming trend among the various regions and that we explain with the entry and tracing of the first cases of coronavirus in those northernmost regions that, representing the most “economically exposed” part of the country, were more subject to frequent travel from the areas of the world where the contagion began.

We have noticed how these data seem - at first glance - to play a role in the vaccination campaign, which proceeds more rapidly in the northern part of the country. The Italian vaccination campaign seems to have been effective to date, despite the various “problems” that have arisen and exposed above, such as the case of the “short-circuit information” on the vaccine AstraZeneca. Despite everything, the current state of the vaccination campaign can be evaluated positively, especially if we take into account the large population of the individual regions. Obviously, there are some that have lagged behind the average vaccine/population ratio held by Italy, and so far we have highlighted the first evidence found using factors such as income and education. We will try to highlight these data even better as the project progresses. For more in-depth considerations on the vaccination campaign, we invite you to refer to the specific paragraphs above.

As a final point, we feel it is important to emphasize how at several points in our analysis the Autonomous Province of Bolzano/Bozen is an outlier compared to all other Italian regions. Such different data, especially regarding the number of swabs performed (the ratio of swabs/population is 4.5) could be explained by the geographical location of the Autonomous Province of Bolzano/Bozen. This is in fact the most important crossroad in Italy for the movement of goods which, we must remember, has undergone a drastic decrease in daily values, but never a halt. Let’s suppose, then, that precisely because of the need to prevent covid-positive subjects employed in this sector from moving from Italy to Austria and the rest of Europe (or vice versa), the testing centers had a very high influx of users. Nevertheless, this is the first explanation that we feel we can give and that can be integrated in the future.